This is the Opportunity Mapping 2.0 Technical Document produced by Phuong Tseng. The intention is to capture changes and developments in the 2019 version.
In 2019, there are 5 domains: education, economic & mobility, housing and neighborhood, conduit, and social capital. The social capital domain is a new domain in 2019.
This year, the education domain added a new indicator called Early Childhood Participation Rate or Pre-K. Another indicator, adult with bachelor’s degree was moved from the education domain to the economic & mobility domain in 2019.
common_fields <- c("fips",
"CountyID.x",
"TOTPOP.x", "county_name.x")
edu_list <-
c(
"math_prof",
"read_prof",
"grad_rate",
"pct_not_frpm",
"z_math_prof",
"z_read_prof",
"z_grad_rate",
"az_pct_not_frpm",
"HD01_VD04",
"HD01_VD03",
"ratio",
"ratio2",
"z_preK"
)
There are a few changes to this domain in 2019. The adult with bachelor’s degree was added to this domain, median household income, and median household value. Other indicators such as the commuting time and entry-level jobs’ measures were changed to TCAC’s measures. A new indicator, school district revenue per capita, was added to capture the extent of municipal hoarding. Due to reliability issues of municipal data, school district boundary was used as a proxy instead.
econ_list <- c(
"total_pop_2017",
"below_200_pov_2017.x",
"moe_below_200_pov_2017.x",
"pct_below_pov_2017",
"moe_pct_below_pov_2017",
"pct_below_200_pov_2017.x",
"pct_assist_2017",
"med_hhincome_2017" ,
"moe_med_hhincome_2017" ,
"employed_pop_20to60_2017",
"pct_employed_20to60_2017",
"home_value_2017" ,
"moe_home_value_2017",
"pct_bachelors_plus_2017",
"above_200_pov_2017",
"pct_above_200_pov_2017",
"tot_hh_2017",
"moe_tot_hh_2017",
"moe_pct_long_commute_2017",
"moe_assist_2017",
"moe_long_commute_pct",
"long_commute_pct",
"low_wage_med_distance" ,
"jobs_lowed" ,
"rural_flag",
"az_pct_assist_2017" ,
"az_pct_employed_20to60_2017",
"z_home_value_2017" ,
"z_pct_bachelors_plus_2017" ,
"az_pct_long_commute_2017",
"z_jobs_lowed" ,
"Econ_Domain",
"z_sdrevpcap",
"sdrev",
"sdrevpcap",
"sd_totpop"
)
The housing and neighborhood opportunity domain has two new environmental indicators pulled from CalEnviroScreen (i.e. pm25, lead).
housing_list <-
c("below_200_pov_2017.y",
"moe_below_200_pov_2017.y",
"pct_below_200_pov_2017.y",
"pm25",
"pct_pm25",
"toxRelease",
"pct_toxRelease",
"lead_pctl",
"pct_lead_pctl" ,
"Grocery",
"z_Grocery" ,
"az_Grocery",
"P_INSURED" ,
"az_insurance" ,
"H_Crime",
"pct_parks",
"az_pct_below_200_pov_2017",
"az_pct_below_200_pov_20172",
"az_pct_pm25",
"az_pct_toxRelease",
"az_pct_lead_pctl" ,
"Housing_Env_Domain",
"test_azcrime" ,
"azhealthcare" ,
"zparks"
)
The Conduit domain has two indicators: median broadband download speed and percentage of single-parent households.
conduit_list <-
c(
"pct_singleparent_hh_2017.y",
"moe_pct_singleparent_hh_2017.y",
"az_pct_singleparent_hh_2017",
"TOTPOP.y",
"Median_bb",
"z_broadband",
"z_broadband2",
"Conduit"
)
source(here::here("myfunction",'compile_function.R'))
data <- compile_function(
data = data,
common_fields = common_fields,
a_list = edu_list,
b_list = econ_list,
c_list = housing_list,
d_list = conduit_list,
e_list = socap_list
)
data$edu_domain <- rowSums(data[, c("z_preK","z_math_prof","z_read_prof","z_grad_rate","az_pct_not_frpm")], na.rm=TRUE)
data$edu_domain <- data$edu_domain/5
data$Socap_domain <- rowSums(data[, c("z_regvoter", "zclubs", "zreligious")], na.rm=TRUE)
data$Socap_domain <- data$Socap_domain/3
data$Conduit_domain <- rowSums(data[,c("az_pct_singleparent_hh_2017", "z_broadband")],na.rm=TRUE)
data$Conduit_domain <- data$Conduit_domain/2
data$econ_domain <- rowSums(data[,c("z_jobs_lowed", "az_pct_long_commute_2017", "z_sdrevpcap", "z_home_value_2017", "az_pct_assist_2017", "z_pct_bachelors_plus_2017", "az_pct_employed_20to60_2017")],na.rm=TRUE)
data$econ_domain <- data$econ_domain/7
data$housing_domain <- rowSums(data[,c("test_azcrime", "zparks", "az_Grocery", "az_pct_toxRelease", "az_insurance", "az_pct_lead_pctl", "pct_below_200_pov_2017.x", "az_pct_pm25", "azhealthcare")],na.rm=TRUE)
data$housing_domain <- data$housing_domain/9
#data$fips <- as.character(data$fips)
#data$CountyID <- as.character(data$CountyID.x)
#data$fips <- ifelse(length(data$fips!=11), paste0(0, data$fips), data$fips)
data$index <- (data$housing_domain + data$edu_domain + data$econ_domain + data$Socap_domain + data$Conduit_domain)/5
summary(data$index)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.757397 -0.130434 -0.006168 0.004954 0.119239 1.180872
#df <- data %>% select(fips,housing_domain,edu_domain,econ_domain,Socap_domain, Conduit_domain, index) %>% filter(is.na(index))
returns 471 records with 8 NAs
#filter_function <- function(data, variable1, variabl2, value, value2){
# data$variable1[which(data$variable2)] <- value2
# return(data$variable)
#}
data$SPF_GT_30[which(data$pct_singleparent_hh_2017.y<0.3)] <- 0
data$SPF_GT_30[which(data$pct_singleparent_hh_2017.y>=0.3)] <- -1 #471 records
data$flag_spf <- ifelse(is.na(data$SPF_GT_30), 0, data$SPF_GT_30)
summary(data$flag_spf) #fixed NAs to 0, 471 records
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 -1.0000 0.0000 -0.2983 0.0000 0.0000
returns 418 records with 3 NAs
data$POVR200_GT_30[which(data$pct_below_200_pov_2017.x<0.3)] <- 0
data$POVR200_GT_30[which(data$pct_below_200_pov_2017.x>=0.3)] <- -1 #418
data$POVR200_GT_30 <- ifelse(is.na(data$POVR200_GT_30), 0, data$POVR200_GT_30)
summary(data$POVR200_GT_30) #418
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 -1.0000 0.0000 -0.2647 0.0000 0.0000
data$SPF30_P30[which(data$flag_spf==0 | data$POVR200_GT_30==0)] <- 0
data$SPF30_P30[which(data$flag_spf==-1 & data$POVR200_GT_30==-1)] <- -1
summary(data$SPF30_P30) #fixed NAs to 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 0.0000 0.0000 -0.1849 0.0000 0.0000
sum(data$SPF30_P30) #292
## [1] -292
load(here("data", "input_DI.RData"))
df <- data
df$fips <- as.character(df$fips)
df$fips <- paste0(0,df$fips)
dat <- merge(input_DI,df, by="fips")
data <- dat
data$Flag_HighDI_Blk_Lat[which(data$Black_Latinx<=0.5 | data$divergence_thresh<3)] <- 0
data$Flag_HighDI_Blk_Lat[which(data$Black_Latinx>0.5 & data$divergence_thresh==3)] <- -1
sum(data$Flag_HighDI_Blk_Lat) #201 no NAs
## [1] -201
#save.image(file="save_test.RData")
#load(file="save_test.RData")
sum(data$Flag_HighDI_Blk_Lat) #201 no NAs
## [1] -201
sum(data$POVR200_GT_30) #418
## [1] -418
#test <- test_function(data = data, variable1 = data$Flag_HighDI_Blk_Lat_POV30, variable2 = data$Flag_HighDI_Blk_Lat, variable3 = data$POVR200_GT_30,value1=-1,value0=0)
data$Flag_HighDI_Blk_Lat_POV30[which((data$Flag_HighDI_Blk_Lat==0) | (data$POVR200_GT_30 == 0))] <- 0
data$Flag_HighDI_Blk_Lat_POV30[which((data$Flag_HighDI_Blk_Lat==-1) & (data$POVR200_GT_30 == -1))] <- -1
sum(data$Flag_HighDI_Blk_Lat_POV30) #171 records
## [1] -171
#data$DI_Blk_Lat_POV30[which((data$Black_Latinx < 0.5 & (data$divergence_thresh < 3)) | (data$POVR200_GT_30 == 0))] <- 0
#data$DI_Blk_Lat_POV30[which((data$Black_Latinx > 0.5 & data$divergence_thresh == 3) | (data$POVR200_GT_30 == -1))] <- -1
#320
#sum(data$DI_Blk_Lat_POV30) #WHY this generated 418 NAs
#returned 448?
#filter_na <- data %>% select(fips, county_name,DI_Blk_Lat_POV30,Black_Latinx,divergence_thresh,POVR200_GT_30) %>% filter(is.na(data$DI_Blk_Lat_POV30))
summary(data$POVR200_GT_30) #418 no NAs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 -1.0000 0.0000 -0.2647 0.0000 0.0000
High Divergence with population of Black and Latinx > 50% and poverty (below 200 FPL) >= 30% OR Poverty (below 200 FPL) >= 30% and Single-parent family >= 30%
data$DI_Blk_Lat_POV30_OR_POV30_SPF30[which((
data$Flag_HighDI_Blk_Lat == 0 &
data$POVR200_GT_30 == 0
) |
(
data$POVR200_GT_30 == 0 &
data$pct_singleparent_hh_2017.y < 0.3
)
)] <- 0
data$DI_Blk_Lat_POV30_OR_POV30_SPF30[which((
data$Flag_HighDI_Blk_Lat == -1 &
data$POVR200_GT_30 == -1
) |
(
data$POVR200_GT_30 == -1 &
data$pct_singleparent_hh_2017.y >= 0.3
)
)] <- -1
data$DI_Blk_Lat_POV30_OR_POV30_SPF30 <-
ifelse(
is.na(data$DI_Blk_Lat_POV30_OR_POV30_SPF30),
0,
data$DI_Blk_Lat_POV30_OR_POV30_SPF30
) #fixed NAs
sum(data$DI_Blk_Lat_POV30_OR_POV30_SPF30) #320 records
## [1] -320
#1256 records instead of 1259 records after removing 3 NAs
#quantile(remaining$index, prob = c(0.25, .5, .75))
remaining$category <- "Opportunity"
#Top 75.01%
remaining$category[which(remaining$index > quantile(remaining$index, prob = .75))] <- "Highest Opportunity" #314
#Between 50.01% - 75.00%
remaining$category[which(remaining$index > quantile(remaining$index, prob = .50) & remaining$index <= quantile(remaining$index, prob = .75))] <- "High Opportunity" #314
#Between 25.01% - 50.00%
remaining$category[which(remaining$index > quantile(remaining$index, prob = .25) & remaining$index <= quantile(remaining$index, prob = .50))] <- "Moderate Opportunity" #314
#Bottom 25%
remaining$category[which(remaining$index <= quantile(remaining$index, prob = .25))] <- "Low Opportunity" #314
#Join the datasets with filters
df <-
Reduce(function(x, y, z)
full_join(
x = x,
y = y,
z = z
) ,
list(remaining, categorize, nan_records))
#Top 20% (80-100%)
#data$index <- ifelse(is.na(data$index), 0, data$index)
#Top 20% (80-100%)
data$category_wo_filters[which((data$index > quantile(data$index, prob = .80)))] <- "Highest Opportunity"
#Between 60-80%
data$category_wo_filters[which(data$index >= quantile(data$index, prob = .60) & data$index <= quantile(data$index, prob =.80))] <- "High Opportunity"
#Bottom 20%
data$category_wo_filters[which(data$index <= quantile(data$index, prob = 0.20))] <- "Lowest Opportunity"
#Between 40-60%
data$category_wo_filters[which(data$index < quantile(data$index, prob =.60) & data$index > quantile(data$index, prob = .40))] <- "Moderate Opportunity"
#Between 20-40%
data$category_wo_filters[which(data$index <= quantile(data$index, prob = .40) & data$index >= quantile(data$index, prob = 0.20))] <- "Low Opportunity"
new_df <- merge(data, df, by = intersect(names(data), names(df)), all.x = TRUE)
#names <- intersect(names(data), names(df))
#names(new_df) <- sub("^X", "", names(z))
#make.names(names(z))
#head(new_df)
#write.csv(data, "data.csv")
categories <- fread(here("output", "categories.csv"))
categoriez <- categories
one <- ggplot2::ggplot(categoriez, aes(x=index.x, y=category, color=category)) + geom_point()
one+facet_grid(.~category)
#ggsave("index_category", "index_category.png")
three <-
ggplot(df, aes(x = cbsa, fill = (category))) + geom_bar(position = "dodge")
ggplotly(three)
#Map It
#library(leaflet)
#t <- colorQuantile("YlOrRd", NULL, n = 10)
#leaflet(categories) %>%
# addTiles() %>%
# addCircleMarkers(color = ~t(tann))
p <-
ggplot(df, aes(x = df$county_name, fill = (category))) + geom_bar(position = "dodge")
ggplotly(p)
d <- df[sample(nrow(df), 1500), ]
plot_ly(df, x = df$category, y = df$index,
text = paste("Category: ", df$category),
mode = "markers", color = df$category, size = df$index)
These are records with NAs or missing values
1. fips 06081984300 has NaN in pct_pov_below_200 and pct_singleparent_hh
2. fips 06081984300 (Mod) changed to NAs
3. fips 06095253000 has NaN in pct_pov_below_200 and pct_singleparent_hh
4. fips 06095253000 (Highest) changed to NAs
5. fips 06095980000 has NaN in pct_pov_below_200 and pct_singleparent_hh
6. fips 06095980000 (High) changed to NAs
Data Source: ACS Census data 2010-2014
Description: To analyze the distribution of racial and ethnic composition. I joined the shapefile using the ‘GEOID’ field to match it with the GEOID in the opportunity categories shapefile
race_list <- c("fips", "CountyID.x.x", "TOTPOP.x.x", "total_pop","index","total_pop","white","black", "asian", "hispanic", "other", "county_name","cbsa","divergence_thresh", "DI_Blk_Lat_POV30_OR_POV30_SPF30")
#filter_race <- function(data, r_list){
# daf <- data %>% select(r_list)
# assign(daf, envir=.GlobalEnv)
#}
#source(here(filter_race.R))
#filter_race(data=index_filters, r_list=race_list)
Data Source: American Community Survey (5-year-estimates)
Table: B19013_001 – MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2017 INFLATION-ADJUSTED DOLLARS)
Data Source: ESRI Business Analyst
Spreadsheet: OV_YEAR_Payday
Description: 2017 Measure – Spatially join the payday lending in the bay area shape file to the 2014 census tract shape file with the opportunity categories to obtain the number of businesses per census tract. Then use the count of number of businesses per tract divided by the total count number of payday lending and credit businesses in the Bay Area to obtain the percentage.
2018 Measure – Identify whether the column salevolume in the dataset has the volume of payday loan sales. Aggregate those sales and distribute them to tracts to identify the amount of sales in each neighborhood OR (if it’s possible to) identity where the highest percentage of interests (200-400%) that these payday loans are located and how many of them are in each census tracts.
#load(file="BA_payday_2018.RData")
#proj4string(BA_payday_2018)
Data Source: HUD subsidized housing projects
Spreadsheet: OV_Year_SubHous
Description:
• Data should be gathered through HUD instead of TCAC. Use the file obtained from HUD to create a point shapefile based on the lat and long for each (which is in the table).
• This table has all subsidized housing projects in California; Use geoprocessing to clip the subsidized housing shapefile to Bay Area
• Analysis of Projects and Units should be included in the map based on subsidized units available and the number of subsidized programs in the region.
Data Source: Census Data
Spreadsheet: OV_Year_LowDen
Description: To analyze the density of the census tract and identify areas that are considered low density with 40 or more acres per person
• Calculate the “area” of each tract in acres. Then I divided that by the number of people, and the results are in POP_DEN field. All tracts which had a value of 40 or above were highlighted on the map with a specific symbology
Example:
Step 1: Create a new field, “Acres_per” person for each tract > Calculate Geometry > selecting Area > Coordinate System: Use Coordinate System of the data frame: PCS: NAD 1983 StatePlane California III FIPS 0403 > Units: Acres [US] (ac) > OK
Step 2: Then, create a new field titled, “POP_DEN” in which the value would be “Acres_per” person for each tract divided by the number of people in the tract > select the tracts that have the value of 40 or above
5. Social Capital
This is our newest domain, which has the average distance to a religious institution, registered voters voting rate, and average distance to club membership and etc.